Search CORE

226 research outputs found

The skillful interrogation of the Internet

Author: Crovella Mark
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 01/10/2019
Field of study

As SIGCOMM turns 50, it's interesting to ask how networking research has evolved over time. This is a set of personal observations about the "mindset" associated with Internet research.Accepted manuscrip

Boston University Institutional Repository (OpenBU)

Estimation of instrinsic dimension via clustering

Author: Crovella Mark
Eriksson Brian
Publication venue: Computer Science Department, Boston University
Publication date: 12/05/2011
Field of study

The problem of estimating the intrinsic dimension of a set of points in high dimensional space is a critical issue for a wide range of disciplines, including genomics, finance, and networking. Current estimation techniques are dependent on either the ambient or intrinsic dimension in terms of computational complexity, which may cause these methods to become intractable for large data sets. In this paper, we present a clustering-based methodology that exploits the inherent self-similarity of data to efficiently estimate the intrinsic dimension of a set of points. When the data satisfies a specified general clustering condition, we prove that the estimated dimension approaches the true Hausdorff dimension. Experiments show that the clustering-based approach allows for more efficient and accurate intrinsic dimension estimation compared with all prior techniques, even when the data does not conform to obvious self-similarity structure. Finally, we present empirical results which show the clustering-based estimation allows for a natural partitioning of the data points that lie on separate manifolds of varying intrinsic dimension

Boston University Institutional Repository (OpenBU)

Unravelling the dynamics of online ratings

Author: Crovella Mark
Spinelli Larissa
Publication venue
Publication date: 01/01/2018
Field of study

Online product ratings are an immensely important source of information for consumers and accordingly a strong driver of commerce. Nonetheless, interpreting a particular rating in context can be very challenging. Ratings show significant variation over time, so understanding the reasons behind that variation is important for consumers, platform designers, and product creators. In this paper we contribute a set of tools and results that help shed light on the complexity of ratings dynamics. We consider multiple item types across multiple ratings platforms, and use a interpretable model to decompose ratings in a manner that facilitates comprehensibility. We show that the various kinds of dynamics observed in online ratings are largely understandable as a product of the nature of the ratings platform, the characteristics of the user population, known trends in ratings behavior, and the influence of recommendation systems. Taken together, these results provide a framework for both quantifying and interpreting the factors that drive the dynamics of online ratings.Published versio

Crossref

Boston University Institutional Repository (OpenBU)

The Network Effects of Prefetching

Author: Crovella Mark
Barford Paul
Publication venue: Boston University Computer Science Department
Publication date: 01/01/1997
Field of study

Prefetching has been shown to be an effective technique for reducing user perceived latency in distributed systems. In this paper we show that even when prefetching adds no extra traffic to the network, it can have serious negative performance effects. Straightforward approaches to prefetching increase the burstiness of individual sources, leading to increased average queue sizes in network switches. However, we also show that applications can avoid the undesirable queueing effects of prefetching. In fact, we show that applications employing prefetching can significantly improve network performance, to a level much better than that obtained without any prefetching at all. This is because prefetching offers increased opportunities for traffic shaping that are not available in the absence of prefetching. Using a simple transport rate control mechanism, a prefetching application can modify its behavior from a distinctly ON/OFF entity to one whose data transfer rate changes less abruptly, while still delivering all data in advance of the user's actual requests

Boston University Institutional Repository (OpenBU)

Exploring explanations for matrix factorization recommender systems (Position Paper)

Author: Crovella Mark
Gummadi Krishna
Rastegarpanah Bashir
Publication venue: 'Boise State University'
Publication date: 01/08/2017
Field of study

In this paper we address the problem of finding explanations for collaborative filtering algorithms that use matrix factorization methods. We look for explanations that increase the transparency of the system. To do so, we propose two measures. First, we show a model that describes the contribution of each previous rating given by a user to the generated recommendation. Second, we measure then influence of changing each previous rating of a user on the outcome of the recommender system. We show that under the assumption that there are many more users in the system than there are items, we can efficiently generate each type of explanation by using linear approximations of the recommender system’s behavior for each user, and computing partial derivatives of predicted ratings with respect to each user’s provided ratings.http://scholarworks.boisestate.edu/fatrec/2017/1/7/Published versio

Boston University Institutional Repository (OpenBU)

Describing and Forecasting Video Access Patterns

Author: Crovella Mark
Gursun Gonca
Matta Ibrahim
Publication venue: CS Department, Boston University
Publication date: 10/11/2010
Field of study

Computer systems are increasingly driven by workloads that reflect large-scale social behavior, such as rapid changes in the popularity of media items like videos. Capacity planners and system designers must plan for rapid, massive changes in workloads when such social behavior is a factor. In this paper we make two contributions intended to assist in the design and provisioning of such systems.We analyze an extensive dataset consisting of the daily access counts of hundreds of thousands of YouTube videos. In this dataset, we find that there are two types of videos: those that show rapid changes in popularity, and those that are consistently popular over long time periods. We call these two types rarely-accessed and frequently-accessed videos, respectively. We observe that most of the videos in our data set clearly fall in one of these two types. For each type of video we ask two questions: first, are there relatively simple models that can describe its daily access patterns? And second, can we use these simple models to predict the number of accesses that a video will have in the near future, as a tool for capacity planning? To answer these questions we develop two different frameworks for characterization and forecasting of access patterns. We show that for frequently-accessed videos, daily access patterns can be extracted via principal component analysis, and used efficiently for forecasting. For rarely-accessed videos, we demonstrate a clustering method that allows one to classify bursts of popularity and use those classifications for forecasting

Boston University Institutional Repository (OpenBU)

Targeted matrix completion

Author: Crovella Mark
Ruchansky Natali
Terzi Evimaria
Publication venue
Publication date: 30/04/2017
Field of study

Matrix completion is a problem that arises in many data-analysis settings where the input consists of a partially-observed matrix (e.g., recommender systems, traffic matrix analysis etc.). Classical approaches to matrix completion assume that the input partially-observed matrix is low rank. The success of these methods depends on the number of observed entries and the rank of the matrix; the larger the rank, the more entries need to be observed in order to accurately complete the matrix. In this paper, we deal with matrices that are not necessarily low rank themselves, but rather they contain low-rank submatrices. We propose Targeted, which is a general framework for completing such matrices. In this framework, we first extract the low-rank submatrices and then apply a matrix-completion algorithm to these low-rank submatrices as well as the remainder matrix separately. Although for the completion itself we use state-of-the-art completion methods, our results demonstrate that Targeted achieves significantly smaller reconstruction errors than other classical matrix-completion methods. One of the key technical contributions of the paper lies in the identification of the low-rank submatrices from the input partially-observed matrices.Comment: Proceedings of the 2017 SIAM International Conference on Data Mining (SDM

arXiv.org e-Print Archive

Crossref

Matrix completion with queries

Author: Crovella Mark
Ruchansky Natali
Terzi Evimaria
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 30/04/2017
Field of study

In many applications, e.g., recommender systems and traffic monitoring, the data comes in the form of a matrix that is only partially observed and low rank. A fundamental data-analysis task for these datasets is matrix completion, where the goal is to accurately infer the entries missing from the matrix. Even when the data satisfies the low-rank assumption, classical matrix-completion methods may output completions with significant error -- in that the reconstructed matrix differs significantly from the true underlying matrix. Often, this is due to the fact that the information contained in the observed entries is insufficient. In this work, we address this problem by proposing an active version of matrix completion, where queries can be made to the true underlying matrix. Subsequently, we design Order&Extend, which is the first algorithm to unify a matrix-completion approach and a querying strategy into a single algorithm. Order&Extend is able identify and alleviate insufficient information by judiciously querying a small number of additional entries. In an extensive experimental evaluation on real-world datasets, we demonstrate that our algorithm is efficient and is able to accurately reconstruct the true matrix while asking only a small number of queries.Comment: Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Minin

arXiv.org e-Print Archive

CiteSeerX

On the Intrinsic Locality Properties of Web Reference Streams

Author: Abrahão Bruno
Almeida Virgílio
Crovella Mark
Fonseca Rodrigo
Publication venue: Boston University Computer Science Department
Publication date: 13/08/2002
Field of study

There has been considerable work done in the study of Web reference streams: sequences of requests for Web objects. In particular, many studies have looked at the locality properties of such streams, because of the impact of locality on the design and performance of caching and prefetching systems. However, a general framework for understanding why reference streams exhibit given locality properties has not yet emerged. In this work we take a first step in this direction, based on viewing the Web as a set of reference streams that are transformed by Web components (clients, servers, and intermediaries). We propose a graph-based framework for describing this collection of streams and components. We identify three basic stream transformations that occur at nodes of the graph: aggregation, disaggregation and filtering, and we show how these transformations can be used to abstract the effects of different Web components on their associated reference streams. This view allows a structured approach to the analysis of why reference streams show given properties at different points in the Web. Applying this approach to the study of locality requires good metrics for locality. These metrics must meet three criteria: 1) they must accurately capture temporal locality; 2) they must be independent of trace artifacts such as trace length; and 3) they must not involve manual procedures or model-based assumptions. We describe two metrics meeting these criteria that each capture a different kind of temporal locality in reference streams. The popularity component of temporal locality is captured by entropy, while the correlation component is captured by interreference coefficient of variation. We argue that these metrics are more natural and more useful than previously proposed metrics for temporal locality. We use this framework to analyze a diverse set of Web reference traces. We find that this framework can shed light on how and why locality properties vary across different locations in the Web topology. For example, we find that filtering and aggregation have opposing effects on the popularity component of the temporal locality, which helps to explain why multilevel caching can be effective in the Web. Furthermore, we find that all transformations tend to diminish the correlation component of temporal locality, which has implications for the utility of different cache replacement policies at different points in the Web.National Science Foundation (ANI-9986397, ANI-0095988); CNPq-Brazi

Boston University Institutional Repository (OpenBU)

Network Kriging

Author: Chua David B.
Crovella Mark
Kolaczyk Eric D.
Publication venue
Publication date: 01/01/2005
Field of study

Network service providers and customers are often concerned with aggregate performance measures that span multiple network paths. Unfortunately, forming such network-wide measures can be difficult, due to the issues of scale involved. In particular, the number of paths grows too rapidly with the number of endpoints to make exhaustive measurement practical. As a result, it is of interest to explore the feasibility of methods that dramatically reduce the number of paths measured in such situations while maintaining acceptable accuracy. We cast the problem as one of statistical prediction--in the spirit of the so-called `kriging' problem in spatial statistics--and show that end-to-end network properties may be accurately predicted in many cases using a surprisingly small set of carefully chosen paths. More precisely, we formulate a general framework for the prediction problem, propose a class of linear predictors for standard quantities of interest (e.g., averages, totals, differences) and show that linear algebraic methods of subset selection may be used to effectively choose which paths to measure. We characterize the performance of the resulting methods, both analytically and numerically. The success of our methods derives from the low effective rank of routing matrices as encountered in practice, which appears to be a new observation in its own right with potentially broad implications on network measurement generally.Comment: 16 pages, 9 figures, single-space

arXiv.org e-Print Archive

CiteSeerX

Boston University Institutional Repository (OpenBU)